Creating data visualisation beyond default
In
Take-home Exercise 1, demographic analysis is required to reveal the
demographic of the city of Engagement, Ohio USA by using
tidyverse packages.
Dataset of Participants can be downloaded through the link: participant.csv
In this article, with ggplot2, histogram and boxplots
combined with violin plots. Besides, additional packages:
webr and ggtern are used to create more
complicated plots: pie chart combined with donut plot and ternary plot
respectively.
Before we start to explore and visualize the data, required packages should be installed and the dataset should be imported.
tidyverse is required to visualize the data. Besides, webr is also used in this task to create ternary plot. The code below installs and loads required packages.
packages = c('tidyverse','webr', 'ggtern')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the
data folder into R and save it as a data frame called participants.
participants <- read_csv("data/Participants.csv")
Histograms can visualize the distribution of variables by categories. In this task, the distribution of age can be visualized between “have kids” and “don’t have kids” groups with context of all participants. The sketch is shown below:
Before we plot our graph, we should change the type and name of variable ‘haveKids’ with the code below:
p<- participants
p$haveKids <- as.character(p$haveKids)
p['haveKids'][p['haveKids'] == 'TRUE'] <- "Have kids"
p['haveKids'][p['haveKids'] == 'FALSE'] <- "Dont Have kids"
With code below, we created histograms of the distribution of age between “have kids” and “don’t have kids” groups.
ggplot(p, aes(x = age))+
geom_histogram()+
facet_wrap(~ haveKids)
However, it is not informative since it cannot show us the proportion of different groups with context of all participants as a whole.
With code below, we added a ‘background’, revealing the distribution of age between two groups (with/without kids) with reference to all participants. Besides, title, xlab, ylab are also manipulated to make the graph more informative.
p_bg <- p[, -3]
ggplot(p, aes(x = age, fill = haveKids)) +
ggtitle("Distribution of Age in People Who Have/Don't Have Kids
\n(With Background of the Whole Group)") +
geom_histogram(data = p_bg, fill = "grey", alpha = 0.5) +
geom_histogram(colour = "black") +
facet_wrap(~ haveKids) +
guides(fill = "none") +
xlab("Age") +
ylab("No. of \nPeople") +
theme_bw()
In this makeover histogram, we can see the distribution of age doesn’t have a specific trend in both ‘have kids’ and ‘don’t have kids’ group. The uneven distribution of age in the whole group is mainly attributed by the uneven distribution of age in ‘don’t have kids’ group (see age=30-32 and 42-44).
With boxplot, we can see the distribution (as well as mean value) of age with different categories. For example, we can have the comparison between ‘have kids’ and ‘don’t have kids’ participants in different education levels as well as the comparison between participants with different education levels in ‘have kids’ and ‘don’t have kids’ groups.
The sketch of proposed design is shown below:
With the code below, we can create different groups of boxplot showing the distribution of age and compare the mean in different levels. However, the distribution of age cannot be showed very clearly with only boxplot. The mean values are not very easy to be compared. Besides, xlabs are also overlapped.
ggplot(data=p,
aes(y = age, x= haveKids)) +
facet_grid(~ educationLevel) +
geom_boxplot()
ggplot(data=p,
aes(y = age, x= educationLevel)) +
facet_grid(haveKids ~.) +
geom_boxplot()
With the code below, violin plot is added, showing a very detailed distribution trend of age in each group. Besides, title, xlab and ylab are improved. The point to show mean value and the notch are also added.
ggplot(data=p,
aes(y = age, x= haveKids)) +
ggtitle("Distribution of Age in Different Education Level
\n(Subgrouped by having kids or not) ") +
facet_grid(~ educationLevel) +
geom_violin(fill="light blue") +
geom_boxplot(notch=TRUE, alpha=0.5) +
xlab("Have/Don't Have Kids") +
ylab("Age") +
theme(axis.title.y=element_text(angle = 0))+
theme(axis.text.x=element_text(angle = 10))+
stat_summary(geom = "point",
fun="mean",
colour ="red",
size=2)
ggplot(data=p,
aes(y = age, x= educationLevel)) +
ggtitle("Distribution of Age of People Who Have Kids/Don't Have Kids
\n(Subgrouped by Education Level) ") +
facet_grid(haveKids ~.) +
geom_violin(fill="light blue") +
geom_boxplot(notch=TRUE, alpha=0.5) +
xlab("Education Level") +
ylab("Age") +
theme(axis.title.y=element_text(angle = 0))+
stat_summary(geom = "point",
fun="mean",
colour ="red",
size=2)
In the first Plot, we can see that except for bachelors group, mean age of participants who have kids is smaller than those who don’t have kids, and it is most obvious in low education level group.
In the second Plot, we can see that in the group of ‘Don’t have kids’, participants with bachelor degree have the smallest value of age, while in the group of ‘Have kids’, participants with low education level have the smallest value of age.
Besides, violin plot clearly represents the distribution of age in each education level group
Ternary plot can help use to see the ratios of the three variables as positions in an equilateral triangle. In this task, it can graphically depict the proportions of different age groups of different education level differenciated using points with different color and size.
Young Adults: 18-30 years old
Middle-aged Adults: 31-45 years old
Old Adults: 46-60 years old
age_group<-p [-6:-7]
age_group$Age_Group<-cut(age_group$age,
breaks = c(18,31,46,61),
labels = c("YoungAdults","MiddleAgedAdults", "OldAdults"))
head(age_group)
# A tibble: 6 x 6
participantId householdSize haveKids age educationLevel Age_Group
<dbl> <dbl> <chr> <dbl> <chr> <fct>
1 0 3 Have kids 36 HighSchoolOrC~ MiddleAg~
2 1 3 Have kids 25 HighSchoolOrC~ YoungAdu~
3 2 3 Have kids 35 HighSchoolOrC~ MiddleAg~
4 3 3 Have kids 21 HighSchoolOrC~ YoungAdu~
5 4 3 Have kids 43 Bachelors MiddleAg~
6 5 3 Have kids 32 HighSchoolOrC~ MiddleAg~
age_group_el <- age_group %>%
select(c("participantId","educationLevel","Age_Group")) %>%
group_by(educationLevel, Age_Group) %>%
summarise(population=n())
head(age_group_el)
# A tibble: 6 x 3
# Groups: educationLevel [2]
educationLevel Age_Group population
<chr> <fct> <int>
1 Bachelors YoungAdults 81
2 Bachelors MiddleAgedAdults 72
3 Bachelors OldAdults 74
4 Bachelors <NA> 5
5 Graduate YoungAdults 50
6 Graduate MiddleAgedAdults 59
age_group_el_new <- age_group_el %>%
mutate(i = row_number()) %>%
spread(Age_Group, population) %>%
select(-i)
age_group_el_new <- age_group_el_new[1:4]
age_group_el_new
# A tibble: 16 x 4
# Groups: educationLevel [4]
educationLevel YoungAdults MiddleAgedAdults OldAdults
<chr> <int> <int> <int>
1 Bachelors 81 NA NA
2 Bachelors NA 72 NA
3 Bachelors NA NA 74
4 Bachelors NA NA NA
5 Graduate 50 NA NA
6 Graduate NA 59 NA
7 Graduate NA NA 57
8 Graduate NA NA NA
9 HighSchoolOrCollege 159 NA NA
10 HighSchoolOrCollege NA 173 NA
11 HighSchoolOrCollege NA NA 181
12 HighSchoolOrCollege NA NA NA
13 Low 22 NA NA
14 Low NA 32 NA
15 Low NA NA 27
16 Low NA NA NA
age_group_el_new[is.na(age_group_el_new)] <- 0
age_group_el_final <-age_group_el_new %>%
group_by(educationLevel) %>%
summarise(YoungAdults=sum(YoungAdults),MiddleAgedAdults=sum(MiddleAgedAdults),
OldAdults=sum(OldAdults)) %>%
mutate(Population= YoungAdults + MiddleAgedAdults + OldAdults)
head(age_group_el_final)
# A tibble: 4 x 5
educationLevel YoungAdults MiddleAgedAdults OldAdults Population
<chr> <int> <int> <int> <int>
1 Bachelors 81 72 74 227
2 Graduate 50 59 57 166
3 HighSchoolOrColle~ 159 173 181 513
4 Low 22 32 27 81
With the following code, ternary plot could be built.
However, as we can see, there are lots of flaws to be improved:
The points are overlapped: can be solved by changing the alpha value and xlim/ ylim.
The legend should be moved to the bottom.
The angle of title of each axis should be adjusted.
The title of the ternary should be added.
The arrow in each axis should be added to make the plot easier to read.
ggtern(age_group_el_final, aes(x=YoungAdults,y=MiddleAgedAdults, z=OldAdults,
fill= educationLevel, size=Population)) +
geom_point(shape=21) +
guides(size = FALSE) +
theme_bw()
With the following code, flaws of the original design are solved.
ggtern(age_group_el_final, aes(x=YoungAdults,y=MiddleAgedAdults, z=OldAdults,
fill= educationLevel, size=Population)) +
geom_point(alpha=0.7,shape=21, color="white") +
guides(size = FALSE) +
theme_bw() +
ggtitle("Age Composition in Different Educational Level")+
scale_T_continuous(limits=c(0.25,0.5)) +
scale_L_continuous(limits=c(0.25,0.5)) +
scale_R_continuous(limits=c(0.25,0.5))+
theme_showarrows() +
labs( x = "Young Adults",
xarrow = "% age 18-30 years",
y = "Middle-aged Adults",
yarrow= "% age 31-45 years",
z = "Old Adults",
zarrow = "% age 46-60 years") +
theme(legend.position = "bottom",
legend.box = "horizontal",
tern.axis.title.T = element_text(),
tern.axis.title.L = element_text(hjust = 0.2, vjust = 0.7, angle = -60),
tern.axis.title.R = element_text(hjust = 0.8, vjust = 0.6, angle = 60))
From the makeover plot, we can see the position and size of each education level. High School/ College has the biggest size, followed by Bachelors and Graduate groups, and the size of Low group is much smaller than all the other groups.
Besides, we can see the proportion of participants of different age groups in every education level groups. For example, in Bachelors group, the age composition is roughly ‘Young: Middle-aged: Old = 36:32:33’, which is very even. However, in Low group, it is ‘Young: Middle-aged: Old = 27: 40: 33’.
Pie chart can indicate proportion of values in different groups. In this exercise, we can explore the proportion of participants who have kids and don’t have kids with different household size.
The sketch is shown below:
First, we should change the id of participants into numeric data type and group participants by household size and have/don’t have kids with the count in each subgroups.
p$participantId <- as.numeric(p$participantId)
p$householdSize <- as.character(p$householdSize)
p_new <- p %>% group_by(householdSize, haveKids) %>% count(haveKids)
print(p_new)
# A tibble: 3 x 3
# Groups: householdSize, haveKids [3]
householdSize haveKids n
<chr> <chr> <int>
1 1 Dont Have kids 337
2 2 Dont Have kids 373
3 3 Have kids 301
With the code below (using ggplot2), we can only get individual pie charts and donut plots and those graphs cannot give us a combination of proportion of each group.
ggplot(data=p_new, aes(x=" ", y=n, group=householdSize,
colour=householdSize, fill=householdSize)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0) +
facet_grid(.~ haveKids) +theme_void()
To solve the problem of original design, webr package should be used to build a combination of pie chart and donut plot.
With following code, we can draw a pie chart surrounded with a donut plot to show the relation and proportion of each group.
PieDonut(p_new, aes(haveKids, householdSize, count=n),
explode=2,
explodePos = 0.2,
pieLabelSize = 3.5,
donutLabelSize = 3.5,
titlesize = 4,
title = "Proportion of People Who Have/Don't Have Kids
in Different Household Size")
The makeover plot gives us a clearer overview of the number of participants in each group. Interestingly, there are only three types of household with the size of 1, 2 and 3, and all households with the size of three have kids and all households with the size of one and two don’t have kids and the the number of participants in these three groups are very even (nearly 3:3:3).